{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Intro](#Intro)\n",
    "\t* [Time-Series](#Time-Series)\n",
    "* [Moving Average](#Moving-Average)\n",
    "* [Filling missing values](#Filling-missing-values)\n",
    "\t* [Interpolation](#Interpolation)\n",
    "* [Stationarity [TOFIX]](#Stationarity-[TOFIX])\n",
    "\t* &nbsp;\n",
    "\t\t* [Check on more complex example Time Series](#Check-on-more-complex-example-Time-Series)\n",
    "* [Correlation](#Correlation)\n",
    "* [Arima](#Arima)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notebook that explores time-series and techniques to analyze them.\n",
    "\n",
    "Resources:\n",
    "* [Data Analysis with Open Source Tools](http://shop.oreilly.com/product/9780596802363.do)\n",
    "* [Think Stats 2e - Allen B. Downey](http://greenteapress.com/wp/think-stats-2e/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Time-Series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"A time-series is sequence of measurements from a system that varies in time\"\n",
    "\n",
    "A time-series is generally decomposed in three major components:\n",
    "* **Trend**: persistent change along time.\n",
    "* **Seasonality**: regular periodic variation. There can be multiple seasonalities, and each can span different time-frames (by day, week, month, year, etc.)\n",
    "* **Noise**: random variation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-21T09:37:29.394451",
     "start_time": "2017-08-21T09:37:29.371450"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib notebook\n",
    "\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "\n",
    "sns.set_context(\"paper\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Moving Average"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Moving Averages (also rolling/running average or moving/running mean) is a technique that helps to extract the trend from a series. It reduces noise and decreases impact of outliers. Is consists in dividing the series in overlapping windows of fixed size $N$, and for each considering the average value. It follows that the first $N-1$ values will be undefined, given that they don't have enough predecessors to compute the average.\n",
    "\n",
    "**Exponentially-Weighted Moving Average (EWMA)** is an alternative that gives more importance to recent values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-19T09:46:50.737574",
     "start_time": "2017-08-19T09:46:50.726574"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# rolling mean basic example \n",
    "series = np.arange(10)\n",
    "pd.Series(series).rolling(3).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-19T09:57:08.837928",
     "start_time": "2017-08-19T09:57:08.830927"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# ewma basic example \n",
    "series = np.arange(10)\n",
    "pd.Series(series).ewm(3).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-21T09:42:46.138568",
     "start_time": "2017-08-21T09:42:46.126567"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# ewm on partial long series of 0s\n",
    "series = [1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0]\n",
    "pd.Series(series).ewm(span=2).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-21T09:42:56.021133",
     "start_time": "2017-08-21T09:42:55.958129"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pd.Series(series).ewm(span=2).mean().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Filling missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Basic Pandas methods:     \n",
    "* pad/ffill: fill values forward  \n",
    "* bfill/backfill: fill values backward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Random arrays to play with\n",
    "a = np.arange(20)\n",
    "b = a*a\n",
    "b_empty = np.array(a*a).astype('float')\n",
    "# Add missing values and get a Pandas Series\n",
    "b_empty[[0, 5, 6, 15]] = np.nan\n",
    "c = pd.Series(b_empty)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Visualize how filling method works\n",
    "fig, axes = sns.plt.subplots(2)\n",
    "sns.pointplot(np.arange(20), c, ax=axes[0])\n",
    "sns.pointplot(np.arange(20), c.fillna(method='bfill'), ax=axes[1])\n",
    "sns.plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interpolation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html\n",
    "\n",
    "'linear' ignores the index, ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stationarity [TOFIX]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Link](https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"time_series.csv\")\n",
    "df['datetime'] = pd.to_datetime(df['datetime'])\n",
    "df.set_index(\"datetime\", inplace=True)\n",
    "df.fillna(method='pad', axis=0, inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Determing rolling statistics\n",
    "rolmean = pd.rolling_mean(new_df, window=12)\n",
    "rolstd = pd.rolling_std(new_df, window=12)\n",
    "\n",
    "#Plot rolling statistics:\n",
    "sns.plt.plot(new_df, color='blue',label='Original')\n",
    "sns.plt.plot(rolmean, color='red', label='Rolling Mean')\n",
    "sns.plt.plot(rolstd, color='black', label = 'Rolling Std')\n",
    "sns.plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "new_df = df.copy()\n",
    "new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']\n",
    "new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']\n",
    "new_df.plot()\n",
    "sns.plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check on more complex example Time Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"time_series.csv\")\n",
    "df['datetime'] = pd.to_datetime(df['datetime'])\n",
    "df.set_index(\"datetime\", inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.plot()\n",
    "sns.plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df.fillna(method='pad', axis=0).plot()\n",
    "sns.plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "null_indexes = [i for i, isnull in enumerate(pd.isnull(df['val'].values)) if isnull]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# missing_values_correct \n",
    "y = [32.69,32.15,32.61,29.3,28.96,28.78,31.05,29.58,29.5,30.9,31.26,31.48,29.74,29.31,29.72,28.88,30.2,27.3,26.7,27.52]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filled = df['val'].interpolate(method='time').values\n",
    "predict = filled[null_indexes]\n",
    "len(predict)==len(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "d = sum([abs((y[i]-predict[i])/y[i]) for i in range(len(y))])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Link](https://anomaly.io/understand-auto-cross-correlation-normalized-shift)\n",
    "\n",
    "There exists different methods to analyze correlation for time-series. If we compare two different time series we are talking about **cross-correlation**, while in **auto-correlation** a time-series is compared with itself (can detect seasonality). Both of the previous mentioned categories can use normalization (useful for example when series characterized by different scales, also good for values = zero).\n",
    "\n",
    "Correlation between two time-series $y$ and $x$ is defined as\n",
    "\n",
    "$$ corr(x, y) = \\sum_{n=0}^{n-1} x[n]*y[n] $$\n",
    "\n",
    "while normalized correlation is defined as \n",
    "\n",
    "$$norm\\_corr(x,y)=\\dfrac{\\sum_{n=0}^{n-1} x[n]*y[n]}{\\sqrt{\\sum_{n=0}^{n-1} x[n]^2 * \\sum_{n=0}^{n-1} y[n]^2}}$$\n",
    "\n",
    "For auto-correlation we shift the time-series by an interval called **lag**, and then compare the shifted version with the original one to understand the strength of the correlation (process sometime also called serial-correlation, especially when lag=1).\n",
    "The idea is that a series values are not random independent event, but should have some level of dependency with preceding values. This dependency is the pattern we are trying to discover.\n",
    "\n",
    "Suggestions: check correlation after removing the trend. Understand the seasonality more appropriate for your case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-17T13:54:29.101353",
     "start_time": "2017-08-17T13:54:29.095352"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "a = np.array([1,2,-2,4,2,3,1,0])\n",
    "b = np.array([2,3,-2,3,2,4,1,-1])\n",
    "c = np.array([-2,0,4,0,1,1,0,-2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-17T13:54:29.636383",
     "start_time": "2017-08-17T13:54:29.632383"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"a and b correlate value = {}\".format(np.correlate(a, b)[0]))\n",
    "print(\"a and c correlate value = {}\".format(np.correlate(a, c)[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-17T13:54:39.173929",
     "start_time": "2017-08-17T13:54:39.168928"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def normalized_cross_correlation(a, v):\n",
    "    # cross-correlation is simply the dot product of our arrays\n",
    "    cross_cor = np.dot(a, v)\n",
    "    norm_term = np.sqrt(np.sum(a**2) * np.sum(v**2))\n",
    "    return cross_cor/norm_term"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-17T13:54:40.033978",
     "start_time": "2017-08-17T13:54:40.023977"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "normalized_cross_correlation(a, c)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-08-17T13:55:07.366541",
     "start_time": "2017-08-17T13:55:07.362541"
    },
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"a and a/2 correlate value = {}\".format(np.correlate(a, a/2)[0]))\n",
    "print(\"a and a/2 normalized correlate value = {}\".format(normalized_cross_correlation(a, a/2)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Arima"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda env:kaggle]",
   "language": "python",
   "name": "conda-env-kaggle-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  },
  "toc": {
   "colors": {
    "hover_highlight": "#DAA520",
    "running_highlight": "#FF0000",
    "selected_highlight": "#FFD700"
   },
   "moveMenuLeft": true,
   "nav_menu": {
    "height": "155px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 4,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
